In [1]:
# coding: utf-8
import os
from cheshire3.baseObjects import Session
from cheshire3.document import StringDocument
from cheshire3.internal import cheshire3Root
from cheshire3.server import SimpleServer
session = Session()
session.database = 'db_dickens'
serv = SimpleServer(session, os.path.join(cheshire3Root, 'configs', 'serverConfig.xml'))
db = serv.get_object(session, session.database)
qf = db.get_object(session, 'defaultQueryFactory')
resultSetStore = db.get_object(session, 'resultSetStore')
idxStore = db.get_object(session, 'indexStore')
When using the any search function to search for two different terms, the results are wrong.
Problem 1: searching for fog OR dense
is not the same as dense OR fog
.
Problem 2: Second, the counts for fog OR dense
are off.
Currently, there are 150 results for fog OR dense
and 221 for dense OR fog
, but there should be many more (142 or 144 if one counts compound nouns).
In [2]:
# This is the query that is currently being used.
# The count is the number of chapters
query = qf.get_query(session, """
((c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx = "fog") or c3.chapter-idx = "dense")
""")
result_set = db.search(session, query)
print len(result_set)
In [3]:
# To get a more speficic count one also needs to include the numbers of hits
# in the different chapters
def count_total(result_set):
"""
Helper function to count the total number of hits
in the search results
"""
count = 0
for result in result_set:
count += len(result.proxInfo)
return count
In [4]:
count_total(result_set)
Out[4]:
In [5]:
def try_query(query):
"""
Another helper function to take a query and return
the total number of hits
"""
query = qf.get_query(session, query)
result_set = db.search(session, query)
return count_total(result_set)
This query gets wrong results because it the OR query is poorly constructed
In [6]:
try_query("""
((c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx = "dense") or c3.chapter-idx = "fog")
"""
)
Out[6]:
Properly structuring the OR clause takes away the problem of having different results for
for OR dense
dense OR fog
Option 1
In [7]:
try_query("""
(c3.subcorpus-idx all "dickens" and/cql.proxinfo (c3.chapter-idx = "dense" or c3.chapter-idx = "fog"))
"""
)
Out[7]:
Option 2
In [8]:
try_query("""
(c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx any "dense fog")
"""
)
Out[8]:
In [9]:
try_query("""
(c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx any "fog dense")
"""
)
Out[9]:
Option 3: the verbose one
In [10]:
try_query("""
((c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx = "dense") or
(c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx = "fog"))
"""
)
Out[10]:
To really get the right results, though, one should not just use any, but rather any/cql.proxinfo.
In [11]:
try_query("""
(c3.subcorpus-idx all "dickens" and/proxinfo (c3.chapter-idx = "dense" or/proxinfo c3.chapter-idx = "fog"))
"""
)
Out[11]:
Or in its simpler form:
In [12]:
try_query("""
(c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx any/proxinfo "fog dense")
"""
)
Out[12]:
This does not seem to be affected by whether you mention cql or not (that is a cql specification, if I am not wrong).
In [13]:
try_query("""
(c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx any/cql.proxinfo "fog dense")
"""
)
Out[13]:
The counts are now correct:
In [14]:
dense = try_query("""(c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx = "dense")""")
print dense
In [15]:
fog = try_query("""(c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx = "fog")""")
print fog
In [16]:
dense + fog
Out[16]: